In [1]:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import warnings
from custom import custom_funcs as cf
from datetime import datetime

warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2
%matplotlib inline

Nodes and Edges: How do we represent relationships between individuals using NetworkX?

As mentioned earlier, networks, also known as graphs, are comprised of individual entities and their representatives. The technical term for these are nodes and edges, and when we draw them we typically use circles (nodes) and lines (edges).

In this notebook, we will work with a social network of seventh graders, in which nodes are individual students, and edges represent their relationships. Edges between individuals show how often the seventh graders indicated other seventh graders as their favourite.

Data credit: http://konect.uni-koblenz.de/networks/moreno_seventh

Data Representation

In the networkx implementation, graph objects store their data in dictionaries.

Nodes are part of the attribute Graph.node, which is a dictionary where the key is the node ID and the values are a dictionary of attributes.

Edges are part of the attribute Graph.edge, which is a nested dictionary. Data are accessed as such: G.edge[node1][node2]['attr_name'].

Because of the dictionary implementation of the graph, any hashable object can be a node. This means strings and tuples, but not lists and sets.

Load Data

Let's load some real network data to get a feel for the NetworkX API. This dataset comes from a study of 7th grade students.

This directed network contains proximity ratings between studetns from 29 seventh grade students from a school in Victoria. Among other questions the students were asked to nominate their preferred classmates for three different activities. A node represents a student. An edge between two nodes shows that the left student picked the right student as his answer. The edge weights are between 1 and 3 and show how often the left student chose the right student as his favourite.


In [2]:
#G = cf.load_seventh_grader_network()
G = nx.read_gpickle('Synthetic Social Network.pkl')

Basic Network Statistics

Let's first understand how many students and friendships are represented in the network.


In [3]:
# Who are represented in the network?
G.nodes(data=True)


Out[3]:
[(0, {'age': 20, 'sex': 'Male'}),
 (1, {'age': 21, 'sex': 'Female'}),
 (2, {'age': 19, 'sex': 'Male'}),
 (3, {'age': 29, 'sex': 'Female'}),
 (4, {'age': 30, 'sex': 'Male'}),
 (5, {'age': 26, 'sex': 'Female'}),
 (6, {'age': 21, 'sex': 'Male'}),
 (7, {'age': 17, 'sex': 'Female'}),
 (8, {'age': 21, 'sex': 'Male'}),
 (9, {'age': 14, 'sex': 'Male'}),
 (10, {'age': 23, 'sex': 'Male'}),
 (11, {'age': 17, 'sex': 'Female'}),
 (12, {'age': 19, 'sex': 'Male'}),
 (13, {'age': 27, 'sex': 'Female'}),
 (14, {'age': 29, 'sex': 'Female'}),
 (15, {'age': 14, 'sex': 'Male'}),
 (16, {'age': 18, 'sex': 'Female'}),
 (17, {'age': 21, 'sex': 'Female'}),
 (18, {'age': 19, 'sex': 'Male'}),
 (19, {'age': 19, 'sex': 'Female'}),
 (20, {'age': 19, 'sex': 'Female'}),
 (21, {'age': 21, 'sex': 'Male'}),
 (22, {'age': 30, 'sex': 'Female'}),
 (23, {'age': 25, 'sex': 'Female'}),
 (24, {'age': 13, 'sex': 'Male'}),
 (25, {'age': 24, 'sex': 'Female'}),
 (26, {'age': 23, 'sex': 'Male'}),
 (27, {'age': 21, 'sex': 'Male'}),
 (28, {'age': 29, 'sex': 'Female'}),
 (29, {'age': 25, 'sex': 'Male'})]

Exercise

Can you write a single line of code that returns the number of nodes in the graph?


In [4]:
print(len(G.nodes()))
print(len(G))


30
30

Let's now figure out who is connected to who in the network


In [5]:
# Who is connected to who in the network?
G.edges(data=True)


Out[5]:
[(0, 10, {'date': datetime.datetime(2011, 6, 7, 0, 0)}),
 (0, 19, {'date': datetime.datetime(2011, 2, 12, 0, 0)}),
 (0, 12, {'date': datetime.datetime(2006, 8, 28, 0, 0)}),
 (1, 4, {'date': datetime.datetime(2009, 11, 8, 0, 0)}),
 (1, 2, {'date': datetime.datetime(2010, 8, 5, 0, 0)}),
 (1, 3, {'date': datetime.datetime(2005, 2, 3, 0, 0)}),
 (1, 12, {'date': datetime.datetime(2003, 3, 17, 0, 0)}),
 (1, 29, {'date': datetime.datetime(2005, 1, 15, 0, 0)}),
 (2, 16, {'date': datetime.datetime(2002, 5, 27, 0, 0)}),
 (2, 3, {'date': datetime.datetime(2009, 8, 13, 0, 0)}),
 (2, 6, {'date': datetime.datetime(2006, 1, 12, 0, 0)}),
 (2, 19, {'date': datetime.datetime(2010, 1, 6, 0, 0)}),
 (3, 8, {'date': datetime.datetime(2010, 6, 22, 0, 0)}),
 (3, 6, {'date': datetime.datetime(2009, 3, 20, 0, 0)}),
 (3, 23, {'date': datetime.datetime(2003, 11, 9, 0, 0)}),
 (4, 19, {'date': datetime.datetime(2007, 12, 4, 0, 0)}),
 (4, 28, {'date': datetime.datetime(2009, 5, 22, 0, 0)}),
 (6, 23, {'date': datetime.datetime(2011, 3, 4, 0, 0)}),
 (7, 24, {'date': datetime.datetime(2004, 9, 24, 0, 0)}),
 (7, 25, {'date': datetime.datetime(2009, 3, 21, 0, 0)}),
 (8, 17, {'date': datetime.datetime(2005, 11, 16, 0, 0)}),
 (8, 22, {'date': datetime.datetime(2010, 1, 22, 0, 0)}),
 (9, 24, {'date': datetime.datetime(2008, 12, 2, 0, 0)}),
 (9, 17, {'date': datetime.datetime(2009, 10, 11, 0, 0)}),
 (9, 11, {'date': datetime.datetime(2005, 4, 3, 0, 0)}),
 (10, 11, {'date': datetime.datetime(2005, 2, 6, 0, 0)}),
 (10, 21, {'date': datetime.datetime(2007, 1, 21, 0, 0)}),
 (11, 14, {'date': datetime.datetime(2010, 4, 28, 0, 0)}),
 (12, 19, {'date': datetime.datetime(2007, 12, 17, 0, 0)}),
 (12, 29, {'date': datetime.datetime(2008, 8, 27, 0, 0)}),
 (13, 16, {'date': datetime.datetime(2005, 5, 14, 0, 0)}),
 (13, 24, {'date': datetime.datetime(2006, 5, 7, 0, 0)}),
 (13, 14, {'date': datetime.datetime(2011, 3, 19, 0, 0)}),
 (14, 17, {'date': datetime.datetime(2008, 10, 17, 0, 0)}),
 (14, 25, {'date': datetime.datetime(2002, 6, 11, 0, 0)}),
 (15, 24, {'date': datetime.datetime(2007, 9, 2, 0, 0)}),
 (15, 28, {'date': datetime.datetime(2008, 3, 6, 0, 0)}),
 (16, 17, {'date': datetime.datetime(2002, 5, 20, 0, 0)}),
 (16, 19, {'date': datetime.datetime(2005, 8, 20, 0, 0)}),
 (17, 19, {'date': datetime.datetime(2006, 10, 13, 0, 0)}),
 (19, 22, {'date': datetime.datetime(2011, 11, 4, 0, 0)}),
 (19, 27, {'date': datetime.datetime(2009, 7, 27, 0, 0)}),
 (20, 27, {'date': datetime.datetime(2004, 1, 27, 0, 0)}),
 (20, 23, {'date': datetime.datetime(2007, 12, 3, 0, 0)}),
 (21, 27, {'date': datetime.datetime(2007, 2, 1, 0, 0)}),
 (21, 26, {'date': datetime.datetime(2006, 12, 14, 0, 0)}),
 (25, 28, {'date': datetime.datetime(2007, 2, 15, 0, 0)}),
 (26, 29, {'date': datetime.datetime(2006, 12, 19, 0, 0)})]

Exercise

Can you write a single line of code that returns the number of relationships represented?


In [6]:
print(len(G.edges()))


48

Concept

A network, more technically known as a graph, is comprised of:

  • a set of nodes
  • joined by a set of edges

They can be represented as two lists:

  1. A node list: a list of 2-tuples where the first element of each tuple is the representation of the node, and the second element is a dictionary of metadata associated with the node.
  2. An edge list: a list of 3-tuples where the first two elements are the nodes that are connected together, and the third element is a dictionary of metadata associated with the edge.

Since this is a social network of people, there'll be attributes for each individual, such as a student's gender. We can grab that data off from the attributes that are stored with each node.


In [7]:
# Let's get a list of nodes with their attributes.
G.nodes(data=True)

# NetworkX will return a list of tuples in the form (node_id, attribute_dictionary)


Out[7]:
[(0, {'age': 20, 'sex': 'Male'}),
 (1, {'age': 21, 'sex': 'Female'}),
 (2, {'age': 19, 'sex': 'Male'}),
 (3, {'age': 29, 'sex': 'Female'}),
 (4, {'age': 30, 'sex': 'Male'}),
 (5, {'age': 26, 'sex': 'Female'}),
 (6, {'age': 21, 'sex': 'Male'}),
 (7, {'age': 17, 'sex': 'Female'}),
 (8, {'age': 21, 'sex': 'Male'}),
 (9, {'age': 14, 'sex': 'Male'}),
 (10, {'age': 23, 'sex': 'Male'}),
 (11, {'age': 17, 'sex': 'Female'}),
 (12, {'age': 19, 'sex': 'Male'}),
 (13, {'age': 27, 'sex': 'Female'}),
 (14, {'age': 29, 'sex': 'Female'}),
 (15, {'age': 14, 'sex': 'Male'}),
 (16, {'age': 18, 'sex': 'Female'}),
 (17, {'age': 21, 'sex': 'Female'}),
 (18, {'age': 19, 'sex': 'Male'}),
 (19, {'age': 19, 'sex': 'Female'}),
 (20, {'age': 19, 'sex': 'Female'}),
 (21, {'age': 21, 'sex': 'Male'}),
 (22, {'age': 30, 'sex': 'Female'}),
 (23, {'age': 25, 'sex': 'Female'}),
 (24, {'age': 13, 'sex': 'Male'}),
 (25, {'age': 24, 'sex': 'Female'}),
 (26, {'age': 23, 'sex': 'Male'}),
 (27, {'age': 21, 'sex': 'Male'}),
 (28, {'age': 29, 'sex': 'Female'}),
 (29, {'age': 25, 'sex': 'Male'})]

Exercise

Can you count how many males and females are represented in the graph?

Hint: You may want to use the Counter object from the collections module.


In [8]:
from collections import Counter
mf_counts = Counter([d['sex'] for n, d in G.nodes(data=True)])

def test_answer(mf_counts):
    print(mf_counts['Female'])
    print(mf_counts['Male'])
    
test_answer(mf_counts)


15
15

Edges can also store attributes in their attribute dictionary.


In [9]:
G.edges(data=True)


Out[9]:
[(0, 10, {'date': datetime.datetime(2011, 6, 7, 0, 0)}),
 (0, 19, {'date': datetime.datetime(2011, 2, 12, 0, 0)}),
 (0, 12, {'date': datetime.datetime(2006, 8, 28, 0, 0)}),
 (1, 4, {'date': datetime.datetime(2009, 11, 8, 0, 0)}),
 (1, 2, {'date': datetime.datetime(2010, 8, 5, 0, 0)}),
 (1, 3, {'date': datetime.datetime(2005, 2, 3, 0, 0)}),
 (1, 12, {'date': datetime.datetime(2003, 3, 17, 0, 0)}),
 (1, 29, {'date': datetime.datetime(2005, 1, 15, 0, 0)}),
 (2, 16, {'date': datetime.datetime(2002, 5, 27, 0, 0)}),
 (2, 3, {'date': datetime.datetime(2009, 8, 13, 0, 0)}),
 (2, 6, {'date': datetime.datetime(2006, 1, 12, 0, 0)}),
 (2, 19, {'date': datetime.datetime(2010, 1, 6, 0, 0)}),
 (3, 8, {'date': datetime.datetime(2010, 6, 22, 0, 0)}),
 (3, 6, {'date': datetime.datetime(2009, 3, 20, 0, 0)}),
 (3, 23, {'date': datetime.datetime(2003, 11, 9, 0, 0)}),
 (4, 19, {'date': datetime.datetime(2007, 12, 4, 0, 0)}),
 (4, 28, {'date': datetime.datetime(2009, 5, 22, 0, 0)}),
 (6, 23, {'date': datetime.datetime(2011, 3, 4, 0, 0)}),
 (7, 24, {'date': datetime.datetime(2004, 9, 24, 0, 0)}),
 (7, 25, {'date': datetime.datetime(2009, 3, 21, 0, 0)}),
 (8, 17, {'date': datetime.datetime(2005, 11, 16, 0, 0)}),
 (8, 22, {'date': datetime.datetime(2010, 1, 22, 0, 0)}),
 (9, 24, {'date': datetime.datetime(2008, 12, 2, 0, 0)}),
 (9, 17, {'date': datetime.datetime(2009, 10, 11, 0, 0)}),
 (9, 11, {'date': datetime.datetime(2005, 4, 3, 0, 0)}),
 (10, 11, {'date': datetime.datetime(2005, 2, 6, 0, 0)}),
 (10, 21, {'date': datetime.datetime(2007, 1, 21, 0, 0)}),
 (11, 14, {'date': datetime.datetime(2010, 4, 28, 0, 0)}),
 (12, 19, {'date': datetime.datetime(2007, 12, 17, 0, 0)}),
 (12, 29, {'date': datetime.datetime(2008, 8, 27, 0, 0)}),
 (13, 16, {'date': datetime.datetime(2005, 5, 14, 0, 0)}),
 (13, 24, {'date': datetime.datetime(2006, 5, 7, 0, 0)}),
 (13, 14, {'date': datetime.datetime(2011, 3, 19, 0, 0)}),
 (14, 17, {'date': datetime.datetime(2008, 10, 17, 0, 0)}),
 (14, 25, {'date': datetime.datetime(2002, 6, 11, 0, 0)}),
 (15, 24, {'date': datetime.datetime(2007, 9, 2, 0, 0)}),
 (15, 28, {'date': datetime.datetime(2008, 3, 6, 0, 0)}),
 (16, 17, {'date': datetime.datetime(2002, 5, 20, 0, 0)}),
 (16, 19, {'date': datetime.datetime(2005, 8, 20, 0, 0)}),
 (17, 19, {'date': datetime.datetime(2006, 10, 13, 0, 0)}),
 (19, 22, {'date': datetime.datetime(2011, 11, 4, 0, 0)}),
 (19, 27, {'date': datetime.datetime(2009, 7, 27, 0, 0)}),
 (20, 27, {'date': datetime.datetime(2004, 1, 27, 0, 0)}),
 (20, 23, {'date': datetime.datetime(2007, 12, 3, 0, 0)}),
 (21, 27, {'date': datetime.datetime(2007, 2, 1, 0, 0)}),
 (21, 26, {'date': datetime.datetime(2006, 12, 14, 0, 0)}),
 (25, 28, {'date': datetime.datetime(2007, 2, 15, 0, 0)}),
 (26, 29, {'date': datetime.datetime(2006, 12, 19, 0, 0)})]

In [10]:
lst = [c['date'] for a, b, c in G.edges(data=True)]

#def test_answer(mf_counts):
#    print(mf_counts['Female'])
#    print(mf_counts['Male'])
    
#test_answer(mf_counts)
print(min(lst), max(lst))


2002-05-20 00:00:00 2011-11-04 00:00:00

In this synthetic social network, the number of times the left student indicated that the right student was their favourite is stored in the "count" variable.

Exercise

Can you figure out the maximum times any student rated another student as their favourite?


In [11]:
# Answer
counts = [d['_____'] for _, _, _ in G._______(_________)]
maxcount = max(_________)

def test_maxcount(maxcount):
    assert maxcount == 3
    
test_maxcount(maxcount)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-5e4fc27d9590> in <module>()
      1 # Answer
----> 2 counts = [d['_____'] for _, _, _ in G._______(_________)]
      3 maxcount = max(_________)
      4 
      5 def test_maxcount(maxcount):

AttributeError: 'Graph' object has no attribute '_______'

Exercise

We found out that there are two individuals that we left out of the network, individual no. 30 and 31. They are one male (30) and one female (31), and they are a pair that just love hanging out with one another and with individual 7 (count=3). Add this information to the graph.

If you need more help, check out https://networkx.github.io/documentation/latest/tutorial/tutorial.html


In [12]:
# Answer: Follow the coding pattern.
G.add_node(30, {'age': 22, 'sex': 'Male'})
G.add_node(31, {'age': 24, 'sex': 'Female'})

G.add_edge(30, 31, date= datetime(2010, 1, 9))
G.add_edge(30, 7, date= datetime(2009, 12, 11))
G.add_edge(31, 7, date= datetime(2009, 12, 11))

G.edge[30]


Out[12]:
{7: {'date': datetime.datetime(2009, 12, 11, 0, 0)},
 31: {'date': datetime.datetime(2010, 1, 9, 0, 0)}}

Verify that you have added in the edges and nodes correctly by running the following cell.


In [13]:
def test_graph_integrity(G):
    assert 30 in G.nodes()
    assert 31 in G.nodes()
    assert G.node[30]['sex'] == 'Male'
    assert G.node[31]['sex'] == 'Female'
    assert G.has_edge(30, 31)
    assert G.has_edge(30, 7)
    assert G.has_edge(31, 7)
    print('All tests passed.')
    
test_graph_integrity(G)


All tests passed.

Tests

A note about the tests: Testing is good practice when writing code. Well-crafted assertion statements help you program defensivel, by forcing you to explicitly state your assumptions about the code or data.

For more references on defensive programming, check out Software Carpentry's website: http://swcarpentry.github.io/python-novice-inflammation/08-defensive.html

For more information on writing tests for your data, check out these slides from a lightning talk I gave at Boston Python and SciPy 2015: http://j.mp/data-test

Coding Patterns

These are some recommended coding patterns when doing network analysis using NetworkX, which stem from my roughly two years of experience with the package.

Iterating using List Comprehensions

I would recommend that you use the following for compactness:

[d['attr'] for n, d in G.nodes(data=True)]

And if the node is unimportant, you can do:

[d['attr'] for _, d in G.nodes(data=True)]

Iterating over Edges using List Comprehensions

A similar pattern can be used for edges:

[n2 for n1, n2, d in G.edges(data=True)]

or

[n2 for _, n2, d in G.edges(data=True)]

If the graph you are constructing is a directed graph, with a "source" and "sink" available, then I would recommend the following pattern:

[(sc, sk) for sc, sk, d in G.edges(data=True)]

or

[d['attr'] for sc, sk, d in G.edges(data=True)]

Drawing Graphs

As illustrated above, we can draw graphs using the nx.draw() function. The most popular format for drawing graphs is the node-link diagram.


In [14]:
nx.draw(G)


If the network is small enough to visualize, and the node labels are small enough to fit in a circle, then you can use the with_labels=True argument.


In [15]:
nx.draw(G, with_labels=True)


However, note that if the number of nodes in the graph gets really large, node-link diagrams can begin to look like massive hairballs. This is undesirable for graph visualization.

Instead, we can use a matrix to represent them. The nodes are on the x- and y- axes, and a filled square represent an edge between the nodes. This is done by using the nx.to_numpy_matrix(G) function.

We then use matplotlib's pcolor(numpy_array) function to plot. Because pcolor cannot take in numpy matrices, we will cast the matrix as an array of arrays, and then get pcolor to plot it.


In [16]:
matrix = nx.to_numpy_matrix(G)

plt.pcolor(np.array(matrix))
plt.axes().set_aspect('equal') # set aspect ratio equal to get a square visualization
plt.xlim(min(G.nodes()), max(G.nodes())) # set x and y limits to the number of nodes present.
plt.ylim(min(G.nodes()), max(G.nodes()))
plt.title('Adjacency Matrix')
plt.show()


Let's try another visualization, the Circos plot. We can order the nodes in the Circos plot according to the node ID, but any other ordering is possible as well. Edges are drawn between two nodes.

Credit goes to Justin Zabilansky (MIT) for the implementation, and Jon Charest for improvements.


In [17]:
from circos import CircosPlot

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

nodes = sorted(G.nodes())
edges = G.edges()
node_cmap = {'Male':'blue', 'Female':'red'}
nodecolors = [node_cmap[G.node[n]['sex']] for n in G.nodes()]

c = CircosPlot(nodes, edges, radius=10, ax=ax, fig=fig, nodecolor=nodecolors)
c.draw()  
plt.savefig('images/seventh.png', dpi=300)


This visualization helps us highlight nodes that there are poorly connected, and others that are strongly connected.

Next up, let's try Hive Plots.


In [18]:
from hiveplot import HivePlot

nodes = dict()
nodes['Male'] = [n for n,d in G.nodes(data=True) if d['sex'] == 'Male']
nodes['Female'] = [n for n,d in G.nodes(data=True) if d['sex'] == 'Female']

edges = dict()
edges['group1'] = G.edges(data=True)

nodes_cmap = dict()
nodes_cmap['Male'] = 'blue'
nodes_cmap['Female'] = 'red'

edges_cmap = dict()
edges_cmap['group1'] = 'black'

In [19]:
h = HivePlot(nodes, edges, nodes_cmap, edges_cmap)
h.draw()


Hive plots allow us to divide our nodes into sub-groups, and visualize the within- and between-group connectivity.


In [ ]: